Random Forest Analysis of Heart Failure Dataset

Sierra Landacre, Tim Leschke, Pamela Mishaw, and Pallak Singh

Random Forest

  • Random Forest is a promising machine learning model because it correctly classifies data from large data sets, is resistant to outliers, and is easy to use.
  • Random Forest is a group of decision trees (a forest) that are created from identically distributed, independent random samples of data drawn with replacement from the original dataset (Breiman 2001).
  • Leo Breiman addresses poor-partitioning, noise, and long compute time with his proposal of the Random Forest model (Breiman 2001).
  • An important feature of random Forest is its ability to provide measures of how strong a given variable is associated with a classification result.

Random Forest Continued

  • Random forest provides valuable insights into feature importance, aiding researchers in understanding which variables contribute most significantly to predictions.
  • This attribute is critical in applications where identifying key factors, such as genetic markers in genome-wide association studies, environmental variables in ecological analyses, and heart failure classification is paramount.
  • It is an ideal algorithm for studies that involve numerous features such as those found in the biomedical sciences.
  • Its methodology is also more easily conveyed and understood by medical professionals than many other machine learning models, allowing for greater promise in successful real-world implementation.

Random Forest Success Stories

  • Random Forest is used to accurately predict in-hospital mortality for acute kidney injury patients in intensive care (Lin, Hu, and Kong 2019).
  • Random Forest is used to accurately predict future estimated glomerular filtration rates using electronic medical record data (Zhao, Gu, and McDermaid 2019).
  • Random Forest is used to accurately classify gait as displaying or lacking the characteristic of hemiplegia (Luo et al. 2020).
  • Random Forest is used to accurately classify persons with type 1 and type 2 diabetes (Rachmawanto et al. 2021).

Methods

Single classification tree.

Random Forest uses multiple classification trees.

Methods

The Gini Index is referred to as a measure of node purity (James et al. 2021). It can also be used to measure the importance of each predictor. The Gini Index is defined by the following formula where K is the number of classes and \({\hat{p}_{mk}}\) is the proportion of observations in the mth region that are from the kth class. A Gini Index of 0 represents perfect purity.

\[D=-\sum_{n=1}^{K} {\hat{p}_{mk}}(1-\hat{p}_{mk})\]

Bagging is the aggregation of the results from each decision tree. It is defined by the following formula where B is the number of training sets and \(\hat{f}^{*b}\) is the prediction model. Although bagging improves prediction accuracy, it makes interpreting the results harder as they cannot be visualized as easily as a single decision tree (James et al. 2021).

\[{\hat{f}bag(x) = 1/B \sum_{b=1}^{B}\hat{f}^{*b}(x)}\]

Analysis and Results

  • We ingest the data into R-Studio
  • Perform classification with Random Forest
  • Perform analysis of results

Data Used

  • The dataset comes from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad Pakistan.
  • There are 299 patient records with 13 features per record.
  • Random Forest model used for predicting heart failure events for patients.

Model Evaluation Metrics

  • Out of Bag error rate/accuracy - OOB data is data left unused by a decision tree.
  • Confusion Matrix - shows True Positive, False Positive, False Negative and True Negative values to support performance evaluation
  • Precision - also known as sensitivity and it represents how many observations labeled positive are actually positive.
  • Recall - quantifies how many positive observations are actually predicted as positive.
  • F1 - harmonic mean of the precision and recall; assesses predictive performance.
  • Balanced accuracy - average accuracy of both true-positive and true-negative classes.
  • AUC-ROC - area under a curve created by the true positive rate vs. false positive rate.

Examining the Dataset

Variable Importance Plot

FourFold Plot (Confusion Matrix)

Confusion Matrix Heatmap

Variable Correlation Heatmap

ROC - Default Values

trainControl() - Random Selection

Variable Importance Plots

Default

Tuned

FourFold Plot - Default vs. Tuned

Kappa Plot

Tuned vs. Default ROC

Model Results

  • The testing accuracies of both models are slightly lower than their respective training accuracies which indicates overfitting of the training dataset.
  • The performance of Model 2 is better than model 1. This is based off of the model’s training and testing accuracies; precision; F1; and balanced accuracy as they are all higher than those of the default model.

Conclusion

  • The Random Forest model is applied to a heart failure dataset of patients from Faisalabad Pakistan.
  • The Random Forest model provides strong classification results with its default settings.
  • Random Forest makes its decisions in a black box, which means we do not fully understand how is classifies data or which variables are used.
  • We do know from a review of the variable importance plot that time before follow-up visit, serum creatinine levels, and ejection fraction amount are all likely influential in the classification process.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.
James, Gareth, Daniela Witten, Hastie Trevor, and Robert Tibshirani. 2021. An Introduction to Statistical Learning, 2nd Edition. Springer of New York.
Lin, Ke, Yonghua Hu, and Guilan Kong. 2019. “Predicting in-Hospital Mortality of Patients with Acute Kidney Injury in the ICU Using Random Forest Model.” International Journal of Medical Informatics 125: 55–61.
Luo, Guoliang, Yean Zhu, Rui Wang, Yang Tong, Wei Lu, and Haolun Wang. 2020. “Random Forest–Based Classsification and Analysis of Hemiplegia Gait Using Low-Cost Depth Cameras.” Medical & Biological Engineering & Computing 58: 373–82.
Rachmawanto, Eko Hari, De Rosal Ignatious Moses Setiadi, Nova Rijati, Ajib Susanto, Ibnu Utomo Wahyu Mulyono, and Hidayah Rahmalan. 2021. “Attribute Selection Analysis for the Random Forest Classification in Unbalanced Diabetes Dataset.” 2021 International Seminar on Application for Technology of Informatoon and Communication (iSemantic), 82–86.
Zhao, Jing, Shaopeng Gu, and Adam McDermaid. 2019. “Predicting Outcomes of Chronic Kidney Disease from EMR Data Based on Random Forest Regression.” Mathematical Biosciences 310: 24–30.